Semi- Quantitative Group Testing: a General 
Paradigm with Applications in Genotyping 

Amin Emad and Olgica Milenkovic 
October 10, 2012 



Abstract 

We propose a novel group testing method, termed semi-quantitative group testing, motivated by a 
class of problems arising in genome screening experiments. Semi-quantitative group testing (SQGT) is a 
(possibly) non-binary pooling scheme that may be viewed as a concatenation of an adder channel and an 
integer-valued quantizer. In its full generality, SQGT can be viewed as a unifying framework for group 
testing, in the sense that most group testing models are special instances of SQGT. For the new general 
testing scheme, we define the notion of SQ-disjunct and SQ-separable codes, generalizations of the 
classic disjunct and separable codes. We describe several combinatorial and probabilistic constructions 
of such codes. While in most of these constructions, we assume that the number of defectives is much 
smaller than total number of test subjects, we also consider the case in which there is no restriction on 
the number of defectives and they can be as large as the total number of subjects. For these codes, we 
describe a number of decoding algorithms; in particular, we describe belief propagation decoders for 
sparse SQGT codes. We define the notion of capacity of SQGT and evaluate it for some special choices 
of parameters using information theoretic methods. 
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I. Introduction 

Group testing (GT) is a general term for a family of pooling schemes designed to identify a 
number of subjects with some particular characteristic - called defectives (or positives) - among 
a large number of subjects using some experiments (or tests). The idea behind GT is that if the 
number of defectives is much smaller than the number of subjects one can reduce the number 
of experiments required for identifying the defectives by testing properly chosen subgroups of 
subjects rather than testing each subject individually. In its full generality, GT may be viewed 
as the problem of inferring the state of a system from the superposition of the state vectors of 
a subset of the system's elements. As such, it has found many applications in communication 
theory ITJ-Hl, signal processing JH-Q, computer science ||8l- |[T0ll . and mathematics [[TT|. Some 
examples of these applications include error-correcting coding 0, ffl2ll . |[l"3Tl . identifying users 
accessing a multiple access channel (MAC) lfl"4ll . [TT5l . reconstructing sparse signals from low- 
dimensional projections flU, JH, and many others. 

The group testing literature falls into two partially overlapping categories of problems, based 
on the way the number of defectives is modeled: probabilistic GT and combinatorial GT. In the 
former case, a probability distribution is considered for the number of defectives, and the goal is 
usually to minimize the expected number of tests (see for example [fT6ll - [fT9l ^] In the latter case, 
the number of defectives (or at least an upper bound on the number of defectives) is known in 
advance Q. 

Another way to distinguish between different GT schemes is through the way the tests are 
designed. In nonadaptive group testing all the tests are designed in advance^] In other words, the 
tests are designed in one pass, and the outcome of a test does not affect the design of the other 
tests. On the other hand, in sequential (adaptive) group testing the result of one test may be 
used to govern the design of other tests, leading to more efficient pooling schemes (see Q and 
references therein). Although, in general, sequential GT requires fewer tests, in most practical 
applications nonadaptive GT is preferred, since one is able to perform all tests simultaneously. 
This reduces the time and labor required for testing. In what follows, we focus on combinatorial, 
nonadaptive GT. 

'in some papers, "probabilistic group testing" refers to a probabilistic construction of tests in a combinatorial GT model. In 
this paper, we refer to such constructions as "probabilistic constructions" as opposed to "explicit constructions". 

2 The design of a single test refers to selecting the subjects that are present in that test. 
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Many different models have been considered for combinatorial GT; in the original setting 
described by Dorfman [[Toll (henceforth, conventional GT or CGT) the result of a test indicates 
if there exist at least one defective in the test (hence, the test output equals if there are no 
defectives in the test, and 1 otherwise). Another important model is the additive model [7|, also 
known as quantitative GT (QGT). In this model, the result of a test equals the exact number 
of defectives in that test. In the threshold group testing (TGT) model ll20l . if the number of 
defectives in a test is smaller than a fixed lower threshold, the test outcome is negative (or equal 
to 0); if the number of defectives is larger than a fixed upper threshold, the test outcome is positive 
(or equal to 1); and if the number of defectives is between the lower and upper threshold, the 
test result is arbitrary (either equal to or 1). The difference between the thresholds specifying 
arbitrary test results is called the gap. In yet another model introduced in [1211 . a threshold is fixed 
beforehand and the test output corresponds to an additive model output whenever the number of 
defectives does not exceed the threshold. If the number of defectives exceeds the threshold, the 
output of the test is some value outside the range of the sub-thresholded additive model output. 

In all these models, each subject is assigned a unique binary vector (or codeword) of length 
equal to the total number of tests. For a given subject, each coordinate of this vector corresponds 
to a test and is equal to 1 if the subject is present in the test, and is equal to otherwise. Since in 
nonadaptive GT all the tests are designed in parallel, it is convenient to group all the codewords 
into a matrix (or code) termed the test matrix (test code). The test matrix is a binary matrix of 
size mxn, where m is the number of tests and n is the number of subjects. The design of efficient 
test matrices (or GT codes) has been a topic of interest for many years, and a variety of test codes 
have been designed for different models and under different assumptions (for a comprehensive 
survey of such codes see Q, 11221 . and Il2~3ll ). The two main families of test codes were originally 
designed for CGT by Kautz and Singleton [l24ll . The first family is known as disjunct codes 
(or zero-false-drop codes), while the second family is usually referred as separable codes (or 
uniquely decipherable codes). Disjunct codes satisfy an inclusion constraint: a rf-disjunct code 
has the property that no codeword is included in (or is covered by) the component-wise Boolean 
ORs of any other d or smaller number of codewords. This property enables disjunct codes to 
uniquely identify up to d defectives and also endows them with an efficient decoding algorithm. 
Separability is a weaker notion than disjunctness as it only requires the component-wise Boolean 
ORs of any two distinct sets of (up to) d codewords to be different. 
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Despite the significant interest the subject has garnered in computer science, coding and 
combinatorial theory, and despite the analysis of many diverse extensions of the underlying 
problem, group testing has still not seen widespread use in medical sciences and biology. Two 
notable exceptions were the early use of group testing for DNA sequence analysis [ 22| and the 
very recent work on group testing for genotyping and biosensing ll25l - [|27l . The reason behind 
this practical failure of group testing in life sciences is that most analytical models do not capture 
the full complexity of bioengineering systems. Model simplifications are necessarily introduced 
in order to derive closed-form expressions on the smallest number of tests required to perform the 
experiments or to guarantee code constructions with provable performance guarantees, thereby 
neglecting the fact that in practical applications such simplifications do not lead to operational 
systems. For example, one would be inclined to accept a number of tests higher than those 
predicted to be theoretically optimal for a coarse model if there is evidence that the scheme is 
suited to the given system constraints. 

This paper is the first step in developing a novel framework for group testing that caters to 
the unique needs of the emerging field of genotyping through high-throughput sequencing^} as 
motivated below. 

A. Challenges in Genotyping, and Semi-quantitative Group Testing 

Genotyping is an emerging field in systems biology concerned with determining genetic 
variations in the traits of individuals. At the core of every genotyping method is DNA sequencing 
- determining the genetic blueprint of an individual - and a comparative analysis of the sequences 
obtained for different individuals. Comparative studies of the DNA makeup play an indispensable 
role in medical genetics, the goals of which are to efficiently determine "outliers" in genetic 
codes that may lead to devastating disorders or mental illness ll25l . 

One of the most important applications of genotyping is detecting the carriers of a particular 
genetic disorder. Since the human genome consists of pairs of chromosomes, and each chro- 
mosome contains genes with matching functionalities, a human who has inherited a mutated 
gene may not display the symptoms of the genetic disease. In this situation, the individual has 
a normal (unmutated) dominant copy of the same gene, which prohibits the disease from being 

'Although this work was motivated by applications in genotyping, the model, results, and code constructions are applicable 
to a wide variety of applications in biology, communication theory, signal processing, etc. 
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expressed. Although the carrier does not display disease symptoms, the offspring of two carriers 
may display the disease; as a result, identifying the carriers of a disease is extremely important. 
While affected individuals can be diagnosed based on their symptoms, a carrier can only be 
diagnosed by DNA screening. 

In the screening process of genotyping, one targets genomic regions known to harbor genetic 
mutations. Until recently, only serial sequencing of the genome of one individual was possible; 
however, the introduction of the new class of genome sequencing methods dubbed the next- 
generation sequencing technologies Il28ll enabled parallel sequencing of the genome. These 
platforms break the genomic region of interest into short fragments and perform millions of 
sequence reads in a single run (for an example of such platforms see Illumina [29]). Due to the 
high cost of sample preparation for sequencing, and, in order to fully utilize the potential of the 
sequencing platforms, multiplexing a large number of specimens in a single batch is essential. 
As a result, group testing presents itself as a natural paradigm to address these challenges, and 
the first steps in this direction were taken in Il30ll . OTTl . ll25Tl . Il26ll . Despite the promising results 
of applying the existing group testing models to genotyping, many practical problems still stand 
in the way of the wide-scale use of this method. 

One such problem arises from the fact that genotyping methods allow for more precise readings 
at the output than classical GT detectors, but still do not provide full information about the 
abundance of a target gene in the test. As a result, codes constructed for CGT or TGT underutilize 
the potential of these sequencers, while codes constructed for QGT are prone to significant errors 
due to "overestimating" the sequencers' precision. Specifically, since the precision of a sequencer, 
in general, depends on the number of defectives (i.e. target genes) and the amount of genetic 
material in the test, the error is signal/design dependent and cannot be modeled easily. In order to 
overcome this problem, in what follows we propose a new framework called semi-quantitative 
group testing (SQGT). Other problems arising in the context of genotyping - such as copy 
number variation lt3Tll - lt34l . probabilistic modeling of family trees within the GT framework, as 
well as multiple gene mutation disorder screening (and the resulting notion of two-dimensional 
group testing) will be discussed elsewhere. 

In SQGT, the result of a test is a non-binary value that depends on the number of defectives 
through a given set of thresholds. The thresholds depend on the sequencer and represent its 
precision. The SQGT paradigm may be viewed as a combination of the adder model (QGT) 
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and a decimator (quantizer). Although QGT has been widely studied in literature, the addition 
of a system-dependent decimator makes test construction and analysis quite challenging. It is 
worth emphasizing that the application of the SQGT model is not limited to genotyping, and in 
general any scheme in which test results are obtained using a test device with limited precision 
may be modeled as an instance of SQGT. In particular, CGT, TGT (with zero gaps), and QGT 
are all special cases of SQGT. 

We also allow for the possibility of having different amounts of sample material for different 
test subjects, which results in non-binary test matrices. Although binary testing is required for 
some applications - such as coin weighing - in other applications, such as conflict resolution in 
multiple access channel (MAC) and genotyping, non-binary tests may be used to further reduce 
the number of tests. In the former example, different non-binary values in a test correspond 
to different power levels of the users, while in the latter example, they correspond to different 
amounts of genetic material of different subjects. The reason that non-binary tests are extremely 
important is that in applications like genotyping, sample preparation is very expensive so that 
one may be inclined to reduce the number of tests at the expense of extracting more genetic 
material. While there exist information theoretic approaches applicable to the study of non-binary 
test matrices Il23l Ch. 6], to the best of the authors' knowledge, the only attempts of non-binary 
code construction relevant to group testing is limited to a handful of papers, including ll35ll 
and 11361 . where constructions are considered for an adder MAC channel (i.e. QGT). 

For the new and versatile model of SQGT with Q-ary test results and g-ary test sample 
sizes, Q,q > 2, we define a new generalization of disjunct and separable codes, called "SQ- 
disjunct" and "SQ- separable" codes, respectively. Probabilistic constructions as well as explicit 
constructions are provided for these two families of codes when the number of defectives is much 
smaller than the total number of subjects. In addition, the important special case of SQGT with 
equidistant thresholds is discussed in detail, and test constructions are provided for this model 
as well. Furthermore, a generalization of the well known Lindstrom construction for QGT Il37ll 
is described, capable of identifying any number of defectives (even as large as the total number 
of subjects). Based on this new construction, a SQ-separable code with equidistant thresholds 
is constructed for SQGT to identify defectives which are not necessarily sparsely present. All 
our derivations have an information theoretic underpinning and are centered around the notion 
of capacity of SQGT, which we study in relation to the minimal number of tests required to 
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identify defectives. 

The paper is organized as follows. Section [n] describes the SQGT model. In Section III 



we define SQ-disjunct and SQ-separable codes and present some properties of these codes. In 
Section llVj we describe a number of constructions for SQGT codes. Belief propagation decoders 



for probabilistic instruction of SQGT codes are described in Section M while Section VI includes 



the information theoretic bounds and the capacity of SQGT. Finally, Section VII concludes the 
paper. 

II. Semi- quantitative Group Testing: The Model 

Throughout this paper, we adopt the following notation. Bold-face upper-case and bold-face 
lower-case letters denote matrices and vectors, respectively. Calligraphic letters are used to denote 
sets. Asymptotic symbols such as ~, o(-), and O(-) are used in a standard manner. For a positive 
integer k, we define [k] ■= {0,1, 1}, and fkj := {1, 2, k}. For simplicity, we sometimes 

use X = {xj}J to denote a set of s codewords X = {xi,x 2 , •■■,x s }. 

Let n, m, and d denote the number of test subjects, the number of tests, and the number of 
defectives, respectively. Let Si denote the i th subject, i e [n], and let Si j = Dj be the j th defective, 
j e Id}. Furthermore, let V denote the set of defectives, so that \V\ = d. Let w e [2] n be a binary 
vector with its i th coordinate equal to 1 if the i th subject is defective, and otherwise. 

We assign to each subject a unique g-ary vector of length m, termed the codeword of the 
subject. Each coordinate of the codeword corresponds to a test. If Xj e [q] m denotes the codeword 
of the i th subject, then the k th coordinate of Xj, denoted by Xj(&), may be viewed as the "amount" 
of Si (i.e. sample size, concentration, etc.) used in the k th test. Note that the symbol indicates 
that Si is not present in the test. We denote the test matrix, or equivalently, the code, by C e 
[g] mxn . The goal is to construct a code such that the defectives can be uniquely identified in an 
SQGT model. Table [I] summarizes these symbols and their meanings. 

The result of each test in SQGT is an integer from the set [Q]. Each test outcome depends 
on the number of defectives and their sample amount in the test through Q thresholds, r/i (I e 
{1, 2, Q}). In order to simplify the relationship between the test results and the codewords 
assigned to the defectives, we use the following definition. 

Definition 1: The "SQ-sum" of a set of s > 1 codewords, X = {xi, x 2 , x s } = {x^}^, in a 
SQGT model with thresholds rj = [Vo = 0, 771, 772, ?7q] t is represented by y x = @* =1 Xj = 
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TABLE I: Table of symbols and their definitions 



Symbol 


Definition 


n 


Total number of subjects 


m 


Number of tests 


7 

a 


Number of defectives 


Q 


Size of the output alphabet 


Q 


Size of the test matrix alphabet 


m 


rT"*1 7 th ,1 111 1 t IT /-~\T1 

The / th threshold where / e [QJ 


V 


Set of defectives 


w e [2] n 


Indicator vector of defectives 


y e [Q] m 


Vector of test results 


Xi e [g] m 


Codeword assigned to the i th subject 


C € [g]™*™ 


Code (test matrix) 


e 


Number of errors in y that C can correct 



xi © x 2 © ■■• © x s , and describes a vector of length m with its k th coordinate equal to 

s 

y x {k)=r if r] r < £ Xj(k) < r) r+1 , 0<r<Q, 

j=i 

where Xj(/c) is the k th coordinate of x j5 and "+" stands for real- valued addition. We call y x e 
[Q] m the syndrome of X. 

Using this definition, the vector of test results for a SQGT model can be described as 

d 

y = 

i=i 

where x^ is the codeword of the j th defective. This equation implies that the result of the k th test 
depends on the sum of the k th coordinate of the defectives' codewords, Ej=i x^ (fc), as shown 
in Fig. [T] In Fig. [2] an example of SQGT code, incidence vector of the defectives, and vector of 
test results, where d = 3, m = 5, n = 10, q = 3, Q = 4, and rj = [0, 2, 3, 5, 7] T is illustrated. 

Based on the definition, it is clear that SQGT may be viewed as a concatenation of an adder 
channel and a decimator (quantizer). Also, if q = Q = 2 and 771 = 1, the SQGT model reduces 
to CGT Furthermore, if Q - 1 = d(q - 1) and Vr e [Q], r] r = r, then the SQGT reduces to the 
adder model (QGT), with a possibly non-binary test matrix. Similarly, TGT (with zero gap) and 
the model in 11211 are also special instances of the SQGT. Fig. [3] demonstrates these models 
for q = 2. Note that in the SQGT model, we assume that t\q > (q - l)d. Of special interest 
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y(k) 1 ... Q-l 



^2*iA k ) 0,1, •■• ,7/1-1,7/1, ••• ,172 — 1, ••• 7/Q_l,--- ,77Q-1. 

Fig. 1: The outcome of the /c th test and its relationship with E^=i x ^(^) through the thresholds 
in a SQGT model with (possibly) non-binary test design. 
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Fig. 2: A test matrix C, indicator vector of defectives w, and the corresponding vector of 
test results y, for an example SQGT scheme with d = 3, m = 5, n = 10, q = 3, Q = 4, and 
»7 =[0,2, 3,5,7] T . 



is a SQGT model with a uniform quantizer - i.e. SQGT with equidistant thresholds. In this 
case, 7] r = rrj, where r € [Q + 1], and the SQ-sum of s codewords, y x = ®* =1 Xj, simplifies to 
y x (k) = x i( fc ) +x 2W+-+ x s( fc ) ^ w here [-J denotes the floor function. We discuss code constructions 
for the uniform model in more detail in the next sections. 

III. Generalized Disjunct and Separable Codes for SQGT 

In what follows, we introduce two new families of codes suitable for SQGT, dubbed SQ- 
disjunct and SQ-separable. These codes can be considered as generalizations of binary disjunct 
and binary separable (uniquely decodable) codes introduced in [f24]| for efficient zero-error 
identification of defectives in CGT It is worth mentioning that the SQ-disjunct codes, similar to 
their CGT counterparts, benefit from a simple decoding algorithm with complexity of O(mn). 
For both of these codes, we use a set of parameters as explained below. 

A [q;Q;rj;(l : u); e]-SQ-disjunct/separable code is a g-ary code for a SQGT model with 
thresholds rj = [0, 771, 772, Vq] T - Such a code is capable of uniquely identifying any number of 
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E*^) 0, 1, ••• ,ri DR -l, V DR , 



(c) TGT with zero gap 



(d) The model in (2T] 



Fig. 3: Different group testing models for the case where q = 2. In these figures, r\ T denotes the 
threshold in TGT and r\ DR denotes the threshold in the model described in ETTl . 



defectives between / and u (i.e. the number of defectives is at least / and at most u), from a 
Q-ary vector of test results containing up to e erroneous test results. For simplicity, when the 
code can only identify exactly d defectives (i.e. I = u = d), we use d instead of (/ : u). Also, in 
the case of equidistant SQGT, we use 77 instead of rj. 

A. SQ-disjunct codes 

In what follows, we define a new family of disjunct codes for SQGT that shares many of the 
properties of binary disjunct codes. We start by providing the following definitions. 

Definition 2: A set of codewords X = {xj}^ with syndrome y x is said to be included in another 
set of codewords Z = {zj}\ with syndrome y z , if Vi € [m], y x (i) < y z {i)- We denote this 
inclusion property by X < Z, or equivalently, y x <y z . 

Remark 1: By this definition, it can be easily verified that if X £ Z, then X < Z. 

Note that for q = Q = 2 and 771 = 1, this definition is equivalent to the definition of inclusion 
for disjunct codes in CGT defined in ll24l . Using the notion of inclusion, we may now define 
SQ-disjunct codes when e = (error-free scenario). 

Definition 3: A code is called a [q; Q; rj; (I'd); 0]-SQ-disjunct code of length m and size n if 
\fs,t < d and for any sets of g-ary codewords X = {x^}^ and Z = {%}*, X < Z implies X £ Z. 
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Theorem 1: A [q;Q;rj;(l : d); 0]-SQ-disjunct code is capable of identifying any number of 
defectives less than or equal to d in the absence of test errors. In other words, given an error- 
free vector of test results y e [Q] m , any codeword with a syndrome included in y corresponds to 
a defective, and any codeword with a syndrome not included in y corresponds to a non-defective 
(i.e. negative subject). 

Proof: Let Xj, i e [n], be a codeword of a [q;Q;rj;(l : e£); 0]-SQ-disjunct code. Since 
y = (jB^x^., for Zj € £>, then if i corresponds to a defective (i.e. z e V), we have y {x } < y. 
Conversely, by Definition |3j it can be easily verified that if i i V and \V\ < d, then y {x } <f\ y. ■ 
We also prove the following useful result. 

Theorem 2: A code is [q; Q; rj; (1-d); 0]-SQ-disjunct if and only if no codeword is included in 
a set of d other codewords. 

Proof: It is easy to verify that if a code is [q; Q; rj; (1-d); 0]-SQ-disjunct, then no codeword 
is included in the set of d other codewords. 

Conversely, let X = {xj}l and Z = {zj}\ be two sets of codewords where s,t < d. From the 
assumption that no codeword is included in a set of d other codewords, one can conclude that 
no codeword is included in a set of t other codewords whenever t < d. If X < Z but X $ Z, 
then there exists a codeword x^ e X, j e [s], such that {x^} $ Z. But since {xj} < X < Z, then 
{xj} < 2, which contradicts the assumption that no codeword is included in t other codewords. 

■ 

Remark 2: From Theorem [2| one can conclude that a code is [q;Q;i]; (1 :d); 0]-SQ-disjunct if 
and only if for any set of d + 1 codewords, X = {xj}f +1 , and for any codeword x, e X, there 
exists at least one "unique coordinate" ki for which 

y { xjO0 >y*ux i} ( fc i) > (!) 

where y {x } is the syndrome of {x, }, and y^ Ux } is the syndrome of the other d codewords in 
X. By unique coordinate, we mean that for any i,j 6 [d + 1], ki + kj if i + j. Note that for 
equidistant SQGT, ([T]) simplifies to 

Xj(fcj) , Y,^lj*jXj(ki) 

f] J T) 

Using the notion of unique coordinate, we can generalize Definition Q to SQ-disjunct codes 
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that are capable of correcting up to e > errors. 

Definition 4: A code is called a [q; Q; rj; (1-d); e]-SQ-disjunct code of length m and size n if 
for any set of d + 1 codewords, X = {x.j}f +1 , and for any codeword Xj e X, there exists a set of 
unique coordinates, Hi, of size at least 2e + 1 such that V/c, € 7^, 

y { x l} (^) >y^ {xj (^), (2) 

where y {x } is the syndrome of {x^}, and y XX{x _ } is the syndrome of the d remaining codewords 
in X. 

Such a code is capable of uniquely identifying up to d defectives, in the presence of up to 
e errors in the vector of test results. If a codeword Xj does not correspond to a defective, its 
syndrome contains at least e + 1 coordinates satisfying y {x} (k) > y(k). On the other hand if x 4 
corresponds to a defective, its syndrome contains at most e coordinates satisfying y {x . } (fc) >y(k). 

Remark 3: It can be easily seen from ([T]) and ([2]) that a necessary condition for the existence 
of a [q;Q;rj; (1 : d); e]-SQ-disjunct code is that q- 1 > r)i. As a result, there exist no binary 
[2; Q; rj; (1-d); e]-SQ-disjunct codes when r)i > 1. 

B, SQ-separable Codes 

Although SQ-disjunct codes can be used to find defectives in a SQGT model using a simple 
decoding procedure, the requirements imposed on such codes may appear too restrictive for 
certain applications. As a result, relaxing these structural constraints may lead to codes with 
higher rates. In addition, in some applications there may be some restrictions on the size of the 
alphabet used for designing the code; since SQ-disjunct codes cannot be used for the case where 
q < 771, one may be interested in designing codes with smaller alphabet size (possibly binary, 
q = 2). SQ-separable codes are a family of g-ary codes that are capable of overcoming these 
issues. 

Definition 5 (SQ-separable codes): A code is called a [q; Q; rj; (l-u); e]-SQ-separable code of 
length m and size n if for any two distinct sets of codewords X and Z that satisfy / < \X\, \Z\ < u, 
there exists a set of coordinates 1Z, with size \R\ > 2e + 1, such that VA; e 71 



y x (k)*y z (k). 
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Such codes are capable of identifying defectives when the vector of test results contains at 
most e errors, given that the number of defectives is at least / and at most u. 

Remark 4: From this definition, one can see that a necessary condition for the existence of a 
[q; Q; rj; (I ■ u); e]-SQ-separable code is that l(q - 1) > 771. If I = 1, this condition simplifies to 
q-1 > 771, which is the same as the necessary condition for the existence of a [q; Q; 77; (1 : d); e]- 
SQ-disjunct code. This is expected, since any SQ-disjunct code is also a SQ-separable code (the 
converse is not true). On the other hand, if q = 2, the condition simplifies to I > r\\. This implies 
that if the number of defectives is smaller than 771, one cannot identify the defectives using a 
binary code. 

IV. Code Construction for SQGT 

In this section, we discuss both probabilistic and explicit constructions of SQ-disjunct and 
SQ-separable codes. For each of these families, we propose constructions for a SQGT model 
with arbitrary thresholds, ij. While such constructions are applicable to any set of thresholds, one 
may be able to construct codes with higher rates^j designed specifically for a certain application 
based on the unique properties of that application. In other words, due to the vast generality 
of the SQGT model, a construction that works best for a specific set of thresholds may not 
work for a different set of thresholds. For example, QGT (i.e. adder model) is a special case 
of SQGT; while there are many interesting code constructions for QGT, these constructions 
do not generalize for CGT, another special case of SQGT. As a result, after introducing some 
general constructions, we focus on one of the most important special cases of SQGT: SQGT 
with equidistant thresholds. 

A. Construction of SQ-disjunct codes 

SQ-disjunct codes represent generalizations of conventional binary disjunct codes. As a result, 
it is expected that one can construct SQ-disjunct codes using conventional disjunct codes. The 
following proposition describes one such construction. 

4 The rate of a SQGT code is denoted by R and is defined in a standard manner, R = lo ^ " . 
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Proposition 1 (Construction 1): Any code generated by multiplying a conventional binary d- 
disjunct code (capable of correcting e errors^]) by q-1, where q- 1 > r)%, is a [q; Q; rj; (l:d); e]- 
SQ-disjunct code. As a result, the rate of the best [q; Q; rj; (I'd); e]-SQ-disjunct code is at least 
as large as the rate of the best binary <i-disjunct code with the same size and length. 

Proof: The proof follows easily from the definition of SQ-disjunct codes and conventional 
disjunct codes. ■ 
Next, we focus on SQGT with equidistant thresholds, i.e. codes for which r\ r = rr\, where 
re [Q + l]. The following lemma can assist in simplifying the construction of SQ-disjunct codes 
for equidistant thresholds. 

Lemma 1: Any [q;Q;rj;(l ■ d); e]-SQ-disjunct code C e [g] mxn ca n be transformed into a 
[q; Q; rj; (1 : d); e]-SQ-disjunct code C e {0, 77, 2r], Ir]} mxn , where / = [—J. In other words, 
a [q;Q;r/;(l : d); e]-SQ-disjunct code with optimal rate effectively uses only an (/ + l)-ary 
alphabet, {0,r),2r},--;Ir}}. 

Proof: Form C by the following substitution: V« € [m] and Vj e [n], let C'(i,j) = 
[ C ^' 3 ^ \v € {Oj 7 ?) 2>7) • • • , If}}' Consider a set of d + 1 column-indices S and fix any column- 
index I e S. If C(i,Z), z e [m], is a unique coordinate of the Z th column of C for which (|2]) is 
satisfied for S, ([2]) will still be satisfied in C for / and S. The reason is that after the substitution, 
the i th coordinate of the syndrome of the Z th column remains unchanged, while the i th coordinate 
of the syndrome of the other d codewords indexed by S\{1} will have a smaller value. Since 
this is true for any S £ [n] with |«S| = d+ 1 and for any I e S, then C is still a [q; Q;rj; (I'd); e]- 
SQ-disjunct code. On the other hand, if for i € [m], none of the columns of C indexed by S 
has a unique coordinate in the i th row, then this substitution may generate a unique coordinate 
in a column and therefore improve the error correcting capability of the code. ■ 
We use this lemma to describe a probabilistic construction of SQ-disjunct codes with equidis- 
tant thresholds. 

Theorem 3 (Construction 2): Form a matrix C € {0,r],2r],---, Ir]} mxn by choosing each entry 



5 For constructions of binary d-disjunct codes with error correcting capabilities see for example \7\, |38|, and [39|. 
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P , if x = 

Pi, if x € {77,277,-, I77} 



independently according to the following probability distribution, 

Px(x) 

where / = [^-J, Pq = and Pi = jt^jt- Then C is a [g; Q; rj; (I'd); e]-SQ-disjunct code with 
probability at least 1 - o(l). The rate of this code, P/ = equals 



J " fc=0 



where P x is the rate of a code constructed by multiplying the best (i.e., highest rate) binary 
d-disjunct code capable of correcting up to e errors by rj. 

Proof: Fix a choice of d + 1 column indices, S £ [77], and among them choose one index, 
/ e S. There are ( d + 1 )(c?+ 1) ways to choose S and /. Let 717 be the probability of "success" 
of a row, i.e., the probability that for a row of C denoted by r, one has [—J > [ ^' £l5 ^' } r ^ J. 
Due to the fact that the symbol alphabet consists of integer multiples of 77, the aforementioned 
conditioned is equivalent to 

r(0> Z KO- (3) 

ieS\{l} 

Let Sp be the event that ([3]) is satisfied and that r(l) = /3r/. From this definition, and the law of 
total probability, it follows that 

7r/ = Pr(u^) = £pr(£/j). (4) 

\/8=l / p=l 

On the other hand, one has 



number of positive integer solutions to Y,j=i%j = 0- Since 



where (Au^) counts the number of compositions of i with d - k parts (or equivalently the 

0-1 
i=d-k 



6iM-k-l) U-kl' 
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equation d4l) simplifies to 



717 = 



k=0 

Consequently, using the union bound, we can derive an upper bound on the probability that 
C is not a [q;Q;r]] (l-d); 0]-SQ-disjunct code, 

Pf = ( d U + + 1)(1 - tt/)™ < lJexpC-mTT/) 
< exp ((<i + 1) log n - <ilog(<i + 1) + d + 1 - m.717) . 

As a result, for any 5 > 0, one has P F = o(l) if 



//.' = ^ + o j logra. 



This result can be generalized for [q; Q; rj; (1 : d); e]-SQ-disjunct codes, where e (possibly) 
grows with n. Using the Chernoff Bounds, the probability that, for a fixed S and Z, at most 2e 
rows satisfy ([3]) is upper bounded by 



As a result, the probability that C is not a [q; Q; rj; (l-d); e]-SQ-disjunct code is upper bounded 
by 

(d + 1) logn + d+1 - d\og(d + 1) + 2e 1 . 

2 tjitti ) 
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It can be easily seen that for any 5 > 0, P F = o(l) if 

/2(d+l) \ 4e 
m = + o J logn + — . 

V 717 / 7T/ 

We can compare the rate of a code constructed using this method denoted by Rj, with a code 
constructed by multiplying a d-disjunct code with r], denoted by R 1 . The optimal distribution 
for the latter case can be shown to be equal to Pq = -£r (see Q for this result on classical 
disjunct codes), and therefore 7Ti = r^ra+i is the maximum probability of "success" of a row 6 
As a result, asymptotically one has 

R i 1711 1X1 tc\ 
= — . (6) 

R\ mi 7Ti 

On the other hand, 

711 = 711+ 7/, 

where 7/ = Id+1( j +1)d+1 Eto {t)( d -k + i)( Id ) k ■ Consequently, 

■ 

Fig. |4 shows the asymptotic improvement in the rate, as a function of I for different 
values of d. 

Remark 5: It is worth mentioning that instead of setting P = one can consider P to be a 
parameter that may be optimized for the rate of the code. Making this change does not affect the 
validity of eqs. (|5])-(|6]), but it may increase the ratio Although finding a simple closed-form 
expression for the maximum 77 over P is not possible, we evaluated ([5]) numerically to find 
the maximum probability of "success" of a row. The resulting improvement in the rate is shown 
in Fig. [5] as a function of /, for different values of d. 

As discussed earlier, SQ-disjunct codes are endowed with a simple decoding algorithm of 
complexity 0(mn). Next theorem describes an explicit construction of a code that is constructed 
using SQ-disjunct codes as building blocks; even though this code is not SQ-disjunct, but only 

6 Note that even though 7Ti is the optimal probability of success of a row when q - 1 < 2i], the same statement does not 
necessarily hold for iti found in this construction. 
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Fig. 4: Improvement in the rate of a SQ-disjunct code constructed in Construction 2 with a 
simple choice of Pq. 



SQ-separable, it iteratively uses a decoder for SQ-disjunct codes and maintains a decoding 
complexity of 0(mn). 

Theorem 4 (Construction 3): Fix a binary rf-disjunct code matrix C& of dimensions m& x 



; construct a code of 



capable of correcting up to e errors. Let K = log d ((^J (d- 1) + lj 
length m = m b and size n = iCrii, by concatenating matrices, C = [Ci, C 2 , C#], where 
Cj = (Efro^j^fe) l<j<K. The constructed code is a [g; Q; r^; (1 : d); e]-SQ-separable code 
with decoding complexity of 0(mn). 

Proof: The proof is based on exhibiting a decoding procedure and showing that the procedure 
allows for distinguishing between any two different sets of not more than d defectives. The 
decoder is described below. 

Let y be the Q-axy vector of test outcomes, or equivalently, the syndrome of the defectives. 
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Fig. 5: Improvement in the rate of a SQ-disjunct code constructed based on Construction 2 with 
the optimum choice of Pq. The parameter u, as before, denotes a known upper bound on the 
number of defectives. 

For a rational vector z, let [zj and (z) denote the vector of integer parts of z and fractional parts 
of z, respectively. If d = 1, decoding reduces to finding the column of C equal to r/y. If d > 1, 
decoding proceeds as follows. 

Step 1: Set y' K = y and form vectors yj, 1 < j < K, using the rules: 

Id? - 1\ / d- 1 \ , 

and 

Step 2: Identify the defectives as follows: if the syndrome of a column of is included 
in yj, declare the subject corresponding to that column defective. Declare the subject negative 
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otherwise. 

The result is obviously true for d = 1. Therefore, we focus on the case d > 1. First, using 
induction, one can prove that each yj, 1 < j < K, is the syndrome of a subset of columns of 
Cj corresponding to defectives. Let C- = [Ci, C2, Cj], where 1 < j < K. Since the non-zero 
entries of C are multiples of r/, 7/y is the sum of columns of C corresponding to a subset of 
defectives. Also, the maximum value of the entries of C' K1 equals rj d j^ 1 - Since there are at 
most d defectives, the maximum value of their sum does not exceed V^r- This bound is strictly 
smaller than f7^~r, the minimum non-zero entry of Ck- As a result, yx is the syndrome of 
the defectives with codewords in C K , and y^_ x is the syndrome of defectives with codewords 
in C' K _ V Similarly, it can be shown that Vj, 1 < j < K - 1, yj is the syndrome of the defectives 
with codewords in Cj, and y'-^ is the syndrome of the defectives with codewords in Cj_ v 

From Theorem[TJ we know that each Cj is a [q; Q; rj; (1 :d); e]-SQ-disjunct code. Consequently 
using step 2, one can uniquely identify the defectives with codewords from Cj. ■ 

B. Construction of SQ- separable codes 

Similar to the case of SQ-disjunct codes, SQ-separable codes can also be constructed from 
their binary separable counterparts. 

Proposition 2 (Construction 4): Any code generated by multiplying a conventional binary d- 
separable code (capable of correcting up to e errors) by q - 1, where q - 1 > rji, represents a 
[q; Q; 77; (I'd); e]-SQ-separable code. 

Proof: The proof follows easily from the definition of SQ-separable codes and conventional 
separable codes. ■ 
Although this proposition describes how to construct a SQ-separable code with alphabet size 
q > T]i + 1, it does not address the issue of constructing SQGT codes with alphabet size q < r/i. 
This problem may be solved by noticing that SQGT can be viewed as a generalization of 
threshold group testing (TGT) with zero gap. While in TGT with zero gap, there exist only one 
threshold, in SQGT one may have more than one threshold (Q-ary test results). This implies 
that any code constructed for TGT is also a SQ-separable code. In ll40ll . Chen and Fu observed 
that a variation of binary disjunct codes, also studied under the name of cover-free families (for 
example see I14TTI - I14310 . can be used for TGT. In ll44ll Cheraghchi showed that a weaker notation 
of disjunct codes, so called threshold disjunct codes, are also applicable to TGT and provided 
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constructions with better rates. In the following theorem, we describe a generalization of these 
codes that are particularly useful for SQGT model. This generalization provides binary (and 
non-binary) codes for arbitrary thresholds, rj. 

Theorem 5: Let r} a be the a th threshold in a SQGT model. Consider a matrix C e [g] mx ™ such 
that for any subset of column-indices S £ [n] with ^ < \S\ < d, and for any index I e S, and 
for any J\f e [n] where \J\f \ < \S\ and S n AT = 0, there exists a set of row-indices 1Z with size at 
least 2e + 1 such that Vj e 7£: 

E C O'>*0 e iVl,V2,-,Va}, (V) 
fcf=S 

E c 0',*)=0, (8) 
CO", 0*0. (9) 

Then, C is a [g;Q;?7; :c 0; e]-SQ-separable code. 

Proof: Consider two distinct sets of codewords (i.e. columns of C) denoted by X and Z 
such that [^fj] < 1^1, \Z\ < d. Without loss of generality assume that \X\ > \Z\. Let S be the set of 
column-indices corresponding to X. Also, let H be the set of column-indices corresponding to 
Z\X. Consequently, < |«S| < d, \J\f \ < \S\, and S n M = 0. Let / be the index of the codeword 

e X\Z (this codeword always exists due to the manner in which X and Z are chosen). 

From the definition of C, there exists a set of row-indices with size \TZ\ > 2e + 1 such that 
V/c e TZ, conditions @-(|9]) are satisfied. This implies that VA; e TZ, 

yA k )>y z ( k )- 

As a result, C is a [q; Q; 77; ([^j] e]-SQ-separable code. ■ 
The next theorem describes a probabilistic construction for this type of SQ-separable codes 
with q = 2. This construction can be generalized for q > 2 in a similar manner. 

Theorem 6 (Construction 5): Let r = [log 2 £J+1, y. = £ (l - ±), andp = § E|=i (^l)^ fr- 
Assume that d = o(n). For any i € [r], form a binary matrix Q € [2]( m ' r ) xn by choosing each 
entry independently according to a Bernoulli distribution such that the probability of choosing 
1 equals Pj = 2i+ 2 . Now, form a matrix C = [Cf , Cj, C^] T , where T denotes the matrix 
transpose operator. Then C is a [2;Q;rj; (rj a -d); 0] -SQ-separable code with probability at least 
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o(l), provided that m = rfy + 5jlogn, V<5 > 0. Similarly, C is a [2; Q; rj; (r] a : d); e]-SQ- 



P 



V<5>0. 



separable code with probability at least l-o(l) if m = rNy + 5j logn + 

Proof: The idea behind this construction is that each sub-matrix Cj, i e [r], satisfies 
conditions {7])-(|9]) for different sizes of S. 

From Theorem[5} we know that for q = 2 it is only required to consider S with size r/ a < \S\ < d; 
therefore, for any such choice of S we can find i e [r] such that r] a 2 1 ' 1 < < 2 l r] a . Fix a choice 
of S, a choice of / € S, and a choice of AT such that \J\f\ < \S\. Let A{ denote the total number of 
such choices. Form C j by choosing each entry independently according to a Bernoulli distribution 
such that the probability of choosing 1 equals P t = 2i+ \ . Let 7Tj denote the probability that a 
fixed row of Cj denoted by r satisfies conditions @-(|9]). Note that since the entries of Cj 
are chosen according to an independent, identically distributed (i.i.d.) probability distribution, 
the choice of r does not affect 7Tj. Let £p, (3 e [a], be the event that Y,kas r {^) = Vp> an ^ 
Y,kejv r (k) = 0' an d r (0 = 1- Consequently, 

vr J = Pr(u^) = E Pr (^) 

\/3=l / P=l 

where the second equality follows from the disjointness of these events. A lower bound on the 
probability of the event £p can be found using 

Pr Y, Pr(r(A;) = l, \fk e T) ■ Pr(r(/) = 1) • Pr (r(fc) = 0, \fk e (5 u N)\(T u {/})) 

rss\{i}, 

|T|=^-l 

= E ^ • • (i - Pi) |5|+|A ^ > £ f (i - (I5| + W\ - v fi )Pi) ■ 

T T 

On the other hand, 

PiQS\ + |AA| - ^) < P,(2|<S| - ^) = 2I\QS\ - f ) 

= _JL_(|5|-^)<^1J-<I 
2 i+1 7? Q 2 y _ r?^ 2 i+1 2 



As a result, 



V ^ 2 V 4 2\n g -l) 1 2\ V 8-1 



rip-V 1 ~2\ V p-l 

1 (Pj(]S\ - 1))^ -1) > 1 (2- 3 - 2-'- 2 /ry a )^ ^ - 1 

2 (^-1)^ |*S*| - 1 '2 {^-^ \S\-l 
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2\rjp-l) \S\-l " 2 \rjp-lj d-1 
where [i = ijg (l - -}^\. Consequently, a lower bound on 7Tj reads as 



0=1 



2 t[\m-±J d-i 



(10) 



which is independent of i. 



Using a union bound and pO] ), we arrive at an upper bound on the probability that C does 
not satisfy the conditions in Theorem [5} i.e. 



(ii) 



Here, P Fi (iti) ls the probability that Cj does not satisfy the conditions in Definition [5] for a 
choice of S that satisfies r] a 2 % ' 1 < \S\ < 2 l i] a . Let m' denote the number of rows of Cj, for all 

If e = 0, then 



P Fi (TTi) = (l-TT^r' < (l-p) m ' <exp(-m'p) = p F (p); 



(12) 



otherwise, for e > we can use the Chernoff bound to find 



(13) 



Since these upper bounds are independent of i, (TT) simplifies to 



P F <A a p F (p), 



(14) 



where A a = Yli=x \ and p F (p) is defined in ( fT2] ) and < [T3| ) for e = and e > 0, respectively. 
Since y4 Q is equal to the total number of choices for S, I, and AT, one has 

d /^\ min(s,n— s) 



S=T) a V*/ 2 = \ Z / 

where s denotes the size of S and z denotes the size of Af. Since (™~ s ) < (") for any z e 
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{0, 1, •■•,min(s,n - s)}, and assuming that d < § for simplicity, 

a»^e E(^r («+i) 

<(rf-^)(rf + l)rf(^) M <rf 3 (^) M , 



(15) 



where e = exp(l) denotes the base of the natural logarithm and is not be confused with the 
number of errors e that the code can correct. Note that the third inequality follows from the fact 
that the largest term in £g = „ a { n f) 2S (s + l)s is indexed by s = d. This can be easily shown by 
noting that 

,2s 



{mf +2 (s + !)(* + 2) 



_/ + l\ 2s (s + ] 
l)(s + 2) V s/ s + 



l\ 2s (s + l) 2 se- 2 1 



2 n 2 n : 



< -rs(s + l) < 1. 



Using flT0] >, ( fT2] >, ( fT4> , and (TJ5J), the probability that C is not a [2; Q; 77, a; d, 0]-SQ-separable 
of size n and length m = rm' is upper bounded by 

(ne\ 2d 

Pf ^ d 3 [~TJ exp(-m'p) = exp (2c/ log n + 3 log cZ + 2c/ - 2d log cZ - m'p) . 

As a result, if d - o(n), for any 5 > one has Pp = o(l) if 

'2d 



m = rm = r 



( 2d A^ 
y — + J logn. 



Similarly, the probability that C is not a [2; Q; 77, a; d, e]-SQ-separable of size n and length 
m = rm' is upper bounded by 

„ /fie\ M / m'pl 2e \ 2 \ 
= exp|2c/logn + 31ogc/ + 2c/-2c/logc/- - ^) j • 



Then, if d = o(n), for any 5 > one has Pp = o(l) if 



m = rm 1 = r 



loen + 



4e 



Remark 6: As discussed earlier, any code for TGT without a gap such that its threshold is in 
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r] T e {t]i,T]2,---,T]q} can be used for SQGT. Hence, the rate of the code constructed in Theorem [6j 
denoted by Rsq5, can be compared with the rate of the threshold disjunct codes in [|44l . denoted 
by Rtd- Note that it can be easily shown that the best rate is achieved if r] T = t]i. Consequently, 
one has 

RSQ, [l°g2 j\ + 1 gM^Tg L1QS2 + 1 

Rtd ~[lo g2 fJ + l (^) m ^l > Llog 2 + 1 ■ 

As an example, if d = T] a = 4rji, then > 3. 

Next, we describe an explicit construction of the family of codes described in Theorem [5] 
In Il44ll . an explicit construction of threshold disjunct codes for TGT have been described 
using strong lossless condensers Il45ll . In the following construction, we describe an explicit 
construction of binary SQ-separable codes based on the construction of threshold disjunct codes. 
In our construction, we use the building blocks described in [[441 Construction 3] for TGT and 
leverage the fact that in SQGT we have Q thresholds to increase the rate. 

In [l44l Construction 3], the construction of a building block matrix for threshold disjunct 
codes (henceforth a BBTD matrix) is described. In this construction, a strong lossless (k,e)- 
condenser / : {0, 1}" x {0, 1}* -> {0, 1}' is employed to construct an m' x n' BBTD matrix with 
parameters n' = 2 fl and m' = 2 l+ ~ k ^ n ^ k ) = 2 i+ ~ k O^^tO-*)), where r/ T is the threshold in 
the TGT model, and k and e denote the entropy and the error in the definition of a lossless 
condenser, respectively. In this construction, e < (1 -p)/16 for some real parameter < p < 1. 
Let 7 := max{l, 2 fc_ '2 fe /(10?7 T )}. The following lemma has been proven in [l44l . 

Lemma 2: In a BBTD matrix B with parameters described above, and for any subset of column- 
indices S g [n] with 2~ k - 2 < \S\ < 2 k ~ 1 , and for any M e [n] where |jV| < \S\ and SnM = 0, there 
exists a set of row-indices 1Z with size at least £ry2* such that V j e 1Z 

E B 0'^)=^ (16) 

kaS 

E B 0'^) = °- ( 17 ) 

fcsJV 

These BBTD matrices are then used in [1441 to construct the so-called "regular" matrices, 
which are then used to construct threshold disjunct codes. In the next theorem, we use these 
BBTD matrices to construct SQ-separable code with rates better than any such threshold disjunct 
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codes with r/ T e {771,772,-^}. 

Theorem 7 (Construction 6): Assume that d > rj a > rji > 1. Let 77^ = 2^°^ r i a ' 1 ^ be the smallest 
power of 2 that is at least as large as (i] a - 1), let r = [log 2 ({d - l)/ri' a )~\, and let p e [0, 1). 
Let B = {Bj}g be a set of binary BBTD matrices constructed for parameter i] T = i] 1 - 1 using 
a family of strong lossless (k i: e)-condensers T = {fi} r , where kj = [log 2 (771 - 1)] +i + 1 and 
e < (1 - p)/16. For each i e [r + 1], /, : {0, 1}™ x {0, 1}* -> {0, and for the corresponding 
BBTD matrix one has B, € [2] miX " where m, L = 2*+** O^h^- 1 ^-^) and n = 2™. In step 
1, Vz e [r + 1] construct e [2] 2 m! * n by repeating Bj, 2 r ~ J times according to the rule 
B^ = [B i T ,B i T r--,B i T ] T - In step 2, form matrix C = [B(, T , B; T , -, B; T f. In step 3, fix a 
d-disjunct binary matrix D € [2] mdX ™ capable of correcting e\ errors in the CGT model. Form 
the binary matrix C such that its k th row is equal to the bit-wise OR of the i th row of C and 
the j th row of D, where % = [^-] and j = k - (i - l)m d . Then C is a [2; Q; 77; (r] a : d); e]-SQ- 
separable of size mxn where m = 2*md(d-l)^j (E[=o O^^^i -1 ^ - *^) j j, e = | ^ 2ei+1 )P 2 5 - 1 j ; 
and 7' = maxjl, 5( ^ x) min ie[r . +1] {2 fc -^}}. 

Proof: Consider a set of column-indices S such that r) a - 1 < |<S| < d - 1; then one has 
77^,/2 < |<S| < 2 r r7^,. Consequently, using Lemma [2] we know that Vi € [r + 1], Bj has at least 



P7i2* rows that satisfy ( fTo} and ( |T7| ) for r] T = ?y 1 - 1 and 2 i ~ 1 r)' a < \S\ < 2 i r]' a , where 7$ 



max{l, 2 ki ~ li 2 ki / (10 (ri 1 - 1))}. Since Vz e [r + 1], one has 

2 r - i max{l,2^^2^/(10r7 1 )} > max j 1, 2 k ^' h ?~ l - \ > max 1 1, ^~ ~~~~ min {2^}1, 

I 5(77! -1)1 I 5(77,-1)^+1] J 



then B ? - has at least ^2*7' rows satisfying ( fT6| ) and pTj ) for 2 t ~ 1 rj' a < \S\ < 2 l i]' a , where 



7' = max i 1, — - — ^— min {2 ki H 1 . 
This implies that C has at least e' = p2 l Y rows such that 

EcakJ^-l (18) 

EC(j;*)=o. (19) 

for any set S and N such that 77 a - 1 < \S\ < d - 1, |JV| < |5| and S n AA = 0. 
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In order for C to be a [2; Q; rj; (i] a : d); e]-SQ-separable code^J we need to show that for any 
two distinct sets of codewords (i.e. columns of C) denoted by X\ and X 2 for which^]/^ < \X 2 \ ^ 
\Xi\ < d, one has y Xi + y x . Note that this requirement is weaker than the conditions @-([9]), 
however it is sufficient to have a [2; Q; rj; (jj a ■ d); e]-SQ-separable code. Let S\ and S 2 be the 
set of column-indices corresponding to X\ and X 2 , respectively. Since Si + S 2 and > l^l, 
the set S\\S 2 is nonempty. Let / € S\\S 2 . Since \S 2 \ < d, it follows from the definition of binary 
(i-disjunct matrices that for the set S 2 u {I}, there exists a set of row indices of D denoted by 
1Zy> with size at least 2ei + 1 such that 

£DO",A:)=0, Vj<=ft D , (20) 

fceS 2 

D(j,/) = 1, Vje^n. (21) 

Let S = Si\{l}. Also if Si n S 2 = and = \S 2 \, define M = S 2 \{k } where k is an arbitrary 
column-index of S 2 , otherwise define N = S 2 \Si. By these definitions, \J\f\ < \S\. Let TZ C r be the 
set of row-indices of C for which ( |T8| ) and ( [19] ) are satisfied for the sets S and AT. Consider 
any i e TZ C r and any j e 7^d- The (j + (z - l)md) th row of C is formed by finding the bit- wise 
OR of the i th row of C and the j th row of D. Consequently, 

Y C(j + (^ - l)m d , fc) = J] CO" + (i - 1)^, fc) + CO + (i - l)m d , Z) = rji - 1 + 1 = 771, (22) 

Y C(j + (i-l)m d ,k)< Vl , (23) 

fce52 

where C(j + (i - l)m d ,l) = 1 follows from pT) , and ( |23~] ) follows from the following argument. 



First, note that using ( [17] ) and ( [20] ), one has T,keAfC(j + (i-l)m d , k) = 0. As a result, if SinS 2 = 
and = \S 2 \, then 

Y co + (* - i)md, fc) = Y c + (* _ 1 ) m ^' fc ) + c + (' " l ) m ^ h)<i< m- 

keS 2 kej\f 

7 Although this construction resembles the construction of threshold disjunct codes in |44], one should note that the matrix 
C' generated in step 2 of Construction 6 is not a regular matrix (i.e. it is not a (d — l,e';rji - l)-regular matrix, nor it is a 
(d- l,e';?7 a - l)-regular matrix). Consequently, 1441 Lemma 6] cannot be directly applied to show that C is a SQ-separable 
code. 

8 Note that without loss of generality, we assumed \X 2 \ < \Xi\- 
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Otherwise one has 

Y,C(j + (i-l)m d ,k)=Y l C(j + (i-l)m d ,k) + £ C(j + (i-l)m d ,k) 

= £ C(j + (i - l)m d , k) < Y, C(J + (i-l)m d ,k)=Tj 1 -l<Tn. 

keS 2 nS! fceSi\{Z} 

Since \1Zc\ ^ e ' an d |7^d| ^ 2ei + 1, then C has a set of row indices 72, |7£| > e'(2ei + 1), for 
which ( |22) ) and ( |23] ) are satisfied. This implies that Vj e 72., y Xl (j) > y x (j), and therefore C 
is a [2; Q; rj; (r) a : <i); e]-SQ-separable code, where e = | ^ 2ei+1 )P 2 ^ -1 j. Note that C is an m x n 
matrix where n = 2 n and 

m = m d • ( J]2 r - i m i ] w m d [ ^ 2 r+ * +los ^" 1 - 1 ) +1 O r/i (2 ( "i- 1)( ' V ^ ) 

\i=0 / \i=0 

= 2*"m d (d- 1)^— ^ (^O^OJi-^-**) 

Remark 7: A comparison between the rate of the code described in Construction 6, denoted 
by 7?5Q6, and the rate of the threshold disjunct code described in Il44ll for r\ T = rji, denoted by 
Rtd, reveals that 

Rtd m ~ 1 

While the codes described above may be used for SQGT with an arbitrary set of thresholds, 
it is of interest to consider SQ-separable codes for equidistant SQGT. In this case, SQ-separable 
codes are closely related to separable codes for the additive model (QGT). 

Proposition 3 (Construction 7): Fix a binary <i-separable code for an additive model, capable 
of correcting up to e errors, denoted by C b . Form C by multiplying C5 with q - 1 e {77, 2r), ■■■}. 
Then C is a [q;Q;r); (1-d); e]-SQ-separable code. 

Proof: The proof follows easily from the definition of SQ-separable codes. ■ 
An example of codes for the additive model is the construction by Lindstrom, described in [|37l 
Theorem 8]. In his approach, Lindstrom used a theorem by Bose and Chowla in additive number 
theory Il46ll to construct binary codes for an adder channel. Multiplying this code with 77 results 
in a [q; Q; 77; d; 0]-SQ-separable code of size n and length m = [rflog 2 L], where L is a power of 



)) 
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a prime such that n < L. A similar idea can be used to further improve the rate of SQ-separable 
codes for equidistant SQGT. The idea is based on a result, proved in ll46ll . that shows that if L 
is power of a prime, there exists L nonzero integers smaller than L d such that the sums of any 
d such integers (i.e., the d-sums) are distinct modulo L d - 1. 

Theorem 8 (Construction 8): Let L be a power of a prime such that n < L; also let q' = [— J+L 
Using the construction in 11461 . find L non-zero integers with distinct d-sums. Let the g'-ary 
representation of these integers serve as columns of a code Cy. Form the code C = rj C q > 
of length m = \dlog , L\ and size L. A code obtained by choosing any n columns of C is a 
[q; Q; r/; d; 0]-SQ-separable code of length m and size n. 

Proof: We only need to show that Cy is capable of identifying d defectives in an adder 
model. Assume that there exists two sets of d codewords X = {xj}^ =1 and Z = {zj}j =1 such that 
\X n Z\ < d, and £ti Xj = £j=i Zj. Consequently, VA; e [m], £ti ^(k) = Y,j=i Zj(k). Then, 

mid \ mid \ 

fc=l \i=l / k=l \j=l I 

which implies that there exists two sets of d integers with the same sum. This contradicts the 
assumptions behind the construction of C q /, and completes the proof. ■ 

Remark 8: Another method to improve the rate of the codes in Theorems [2] and [3j and construct 
equidistant SQ-separable codes with higher rates, is to use Construction 3 in Theorem [4] with a 
binary separable code for CGT or a binary separable code for adder model (QGT). 

The constructions described up to this point are able to identify up to d defectives in a pool 
of n subjects whenever d « n. It is also of interest to address the question of what happens if 
d <fc n, and one has < d < n instead. This "dense" testing regime may be of use whenever 
no bound on the number of defectives is known a priori, which may be the case with de novo 
sequencing problems. 

In ll37ll . Lindstrom described a binary construction for the adder model capable of identifying 
up to n defectives. In the next theorem we describe a generalization of this construction that 
employs a g-ary alphabet; using this generalization, we construct a SQ-separable code capable 
of identifying up to n defectives in an equidistant SQGT model. Extensions of [|371 to a g-ary 
alphabet were also addressed in [|35l . Multiplying these codes with r] results in a SQ-separable 
code with the same rate as our construction, to be presented below. It is worth pointing out that 
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the codes in [ 35 ] and [|36ll may only be constructed in a recursive manner, while our construction 
provides a direct, simple method of significant advantage for applications in which the number 
of test subjects is very large. 

Theorem 9 (Construction 9): Let k e Z + and m = 2 K - 1. Consider the set [k] and label each of 
its non-empty subsets by Si, i e [m], such that for any two subsets 5^,5^ £ [«], the inequality 



|<S ix | < \S i2 \ implies % x < i 2 . Let q> = [^J + 1 and q" = log 2 [^J 
Q € [g']" 1 **- 9 + ' 5< ^ as follows. For j e [m] and A; e [g" + 1], set 



; for each Si, form a matrix 




q"-k+l 



if n is odd 
if \Si n Sj\ is even. 



(24) 



Let T i>q "+i =Si. For A; € {g"+2, g"+3, q"+\Si\}, fix anv 7I, fc c % jk -i of size |7^, fe | = \Si\-k+q" + l. 
Set 



1, if Cj(j, fc - 1) > and |<Sj n is odd 
0, otherwise, 



(25) 



where j € [m]. Form a matrix C = r/C where C = [Ci, C2---, C m ]. The matrix C is a [g; Q; (1: 
n); 0]-SQ-separable code of length m = 2 K - 1 and size n = k2 k ^ 1 + q"(2 K - 1). 

Proof: See Appendix |A} ■ 
In order to gain a better understanding of how this construction works, we provide a simple 
example. Let k = 3, 77 = 2, and g = 5; consequently, m = 7, q' = 9, and g" = 2. We label the 
non-empty subsets of [3] as follows: S\ = {1}, S 2 = {2}, <S 3 = {3}, <S 4 = {1,2}, <S 5 = {1,3}, 
S e = {2, 3}, S 7 = {1, 2, 3}. Consider the construction of C 7 corresponding to S 7 . Fix 77,4 = {1, 2} 



and 77,5 = {!}• Using ([24]) and ( [25] ), one has 
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Note that other choices for 77,4 and 77,5 result in different acceptable matrices. 
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S 4 
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0; 



Using ( [24] ) an d < |25] ), one has 

St 

I 

Sx 
S 2 
S 3 

C = 5 4 

So 

Se 
S 7 \ 

In order to prove that C = 2C is a SQ-separable code, we only need to show that C is a 
separable code for an adder model. 

Let w e [2] n be a binary vector such that its Z th coordinate is equal to 1 if the Z th subject is 
defective and otherwise. In the adder model, the vector of test results is equal to y = Cw. 
This is equivalent to a system of linear equations with n variables (coordinates of w) and m 
equations. Note that each set Si corresponds to q" + \Si\ such variables. In order to solve this 
system of equations, we first solve for variables corresponding to S m . After finding their values, 
we solve for variables corresponding to S m -\, and so on, until we find all the variables. 

Returning to our example, we can solve for the variables corresponding to £7 as follows. Add 
all the equations (i.e. rows) corresponding to odd subsets of S 7 . The result is an equation of the 
form 

sL 7 w = y(l)+y(2)+y(3)+y(7), 

where 



5 odd7 



(8 42842842842184218421 16 842 if 



Also, add all the equations (i.e. rows) corresponding to even subsets of S 7 . The result is an 
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equation of the form 

sLn 7 w = y(4) + y(5)+y(6), 

where 

Seven 7 = (8 428428428421842184210000 0) T . 

Since the first 21 entries of s dd 7 and s even7 are identical, one has 

(s odd7 - s eV en 7 ) T w = 16w(22) + 8w(23) + 4w(24) + 2w(25) + w(26) 

= y(l) + y (2) + y(3) + y (7) - y(4) - y(5) - y(6). (28) 

Eq. ( |28~] ) can be considered as the binary representation of the integer y(l) + y(2) + y(3) + 
y(7)-y(4)-y(5)-y(6) and therefore the variables w(22), w(23), w(24), w(25), and w(26) 
are uniquely determined by the above equation. Now, given these variables, one can add all 
the equations corresponding to odd and even subsets of S 6 to similarly identify w(18), w(19), 
w(20), and w(21). This process can be continued until all the variables are uniquely determined. 

Remark 9: Construction 9 provides a code capable of identifying any number of defectives 
among n = k2 k ~ 1 + q"(2 K - 1) subjects, using m = 2 K - 1 experiments. It can be easily shown 
that the same construction can be used for any number of subjects. For a fixed value of q", 
one can find the smallest number n such that n < k2 k ~ 1 + q"(2 K - 1). Dropping the right most 
(k2 k ~ 1 + q"{2 K - 1) - n) columns of C in Construction 9 results in a SQ-separable code of size 
n and length m = 2 K - 1. 

V. Belief Propagation Decoders for SQGT 

In the previous sections, we introduced different codes for SQGT. The SQ-disjunct codes, and 
codes constructed based on them, via Construction 3 have a decoding procedure with complexity 
0(mn) that was discussed in the previous section. In addition, the SQ-separable codes discussed 
in Construction 9 have an iterative decoder that is discussed in the proof of Theorem [9] in 
Appendix [Aj On the other hand, SQ-separable codes in general lack efficient decoders. 

Different decoders have been proposed for CGT in the literature (e.g. Il47ll - ll50l0 . Although 
these algorithms provide efficient decoding in many instances of CGT, due to the more compli- 



33 



cated nature of SQGT, namely the non-binary test matrix and the system-dependent non-binary 
test results, the application of these methods in SQGT seems to be implausible. On the other hand, 
since most proposed SQGT codes are sparse, a decoder based on belief propagation (BP) BTIl 
appears to be a desirable option. In particular, we are interested in designing a decoder that can 
be applied to SQGT codes with probabilistic constructions (such as Constructions 2 and 5). The 
theoretical guarantees for these codes are asymptotic, and when the number of subjects is only 
on the order of a few hundred, these guarantees may not apply. Nevertheless, in what follows, 
we show that BP decoders perform reasonably well even for a small number of subjects and 
large coding rates. 

BP is an iterative message passing algorithm for performing inference on graphical models, 
by calculating the marginal distributions of the variable nodes and has found many applications 
in machine learning, coding theory, etc. BP decoding was originally proposed for binary disjunct 
codes by one of the authors in 11521 . Later on, BP decoding was also considered in ll53Tl for CGT 
decoding. Motivated by these results, we propose a BP decoder for SQGT, which performs an 
approximate bitwise maximum a posteriori (MAP) decoding of SQGT codes in the presence of 
errors. 

Let w e [2] n be the binary incidence vector of defectives, which we want to reconstruct. We 
assume that each subject is defective with probability d/n independent of other subjects (if d is 
not know, we can set this probability equal to 1/2). Also, let C e [g] mxn and y e [Q] m be the 
test matrix and the vector of test results (in the absence of errors), respectively. We assume that 
the result of each test may be subject to an error (due to false positives and false negatives). Let 
z e [Q] m be the noisy vector of test results. 

We model the effect of false positives and false negatives using two probabilities, 7 P and j n , 
respectively. In other words, for the t th test, if y(t) e {1,2,---,Q - 2} then z(t) = y(t) with 
probability 1 - 7 p - 7„, z{t) = y(t) + 1 with probability 7 P , and z(t) = y(t) - 1 with probability 
j n . If y(i) = then z(t) = y(i) with probability 1 - j p and z(£) = y(i) + 1 with probability 
7 P . Finally, if y(t) = Q - 1, then z(t) = y(t) with probability 1 - 7„ and z(t) = y(t) - 1 with 
probability j n . 

For a function /(w), let the sum of /(w) over all configurations of the variables other than 
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w(i) be denoted by £~ w (0 /( w )- For the i th subject, the BP decoder estimates 



wmap(«) = arg max P(w(i)|z,C) 

w(i)e{0,l} 



arg max 

w(i)e{0,l} 



arg max 
w(i)e{0,l} 



I 

-w( 4 ) 

I 

-w(i) 



np(z(t)Kc)n^(w(j)) 



m n /r/ I d\ \ 

Y\P( Z (t)\w,C)Y\ -I(w(j) = l)+ I"" l(w(j) = 0) 
i = i 7=1 V" \ n/ i 



(29) 



where P(-|-) denotes a conditional probability distribution, and I(-) is the indicator function. 

Using ([29]), we can form a factor graph with n variable nodes and m factor nodes; a factor 
node corresponding to test t is only connected to variable nodes, which correspond to subjects 
present in the t th test. Similarly, a variable node corresponding to the i th subject is only connected 
to the factor nodes corresponding to the tests in which the i th subject is used. 

Let Af(t) denote the neighbors of the node corresponding to test t in the factor graph. Also, 
let J\f(i) denote the neighbors of the i th subject. The BP message update rules for finding the 
marginal distributions of each subject are given by: 

d _ , / P 

reX(i)\{t} 

and 



MS X) (w(z)) oc [i l(w(i) = 1) + (l - -) I(w(i) = 0)) n M^wii)), (30) 



~w(i) 



p(z(t)iw,c) n <^w(j)) 



jW(t)\{i} 



(31) 



Let Ui ■= YILi i±i w 0)C(t, I). The term P (z(£)|w, C) in pTj ) has to be calculated for different 
values of z(i). One has 



P(z(t) = 0|w,C) 



7n 1(7/1 < w< < r/ 2 ) + (1 -7 P ) < 771), if w(i) = 

7 n I(r?i-C(t,i) <w i <?72-C(t ) i)) if w(i) = 1, 
+(1-7p) I(wi<77i-C(t,i)), 
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P(z(0=Q-l|w,C) 



(1 -7„) I(r/ Q _i < 0Ji < T)q) + 7 p I(r/Q_ 2 < ^ < r/ Q _i), if w(i) = 

(l-7n) I(VQ-i-C(t,i)<Ui<riQ-C(t,i)) if w(z) = 1, 

+7p I(77Q-2 - C(t, < Ui < 770-1 - C(t, 0), 



and for z(t) = r and re {1, 2, Q - 2}, one has 



P(z(t) =r|w,C) 



(1 -7„ -7 P ) I(77 r < < r] r+1 ) , if w(i) = 

+7 p I(77 r _l < < 7]r) + 7n I < < 77 r+2 ) , 

(1-7^-7^) l(rj r -C(t,i) <Ui<r}r+i-C(t,i)) if w(i) = 1. 
+7 P I(77 r _i - C(t, i) < Ui < r] r - C(t, i)) 
+Tn I (Vr+i ~ C(t, i) <u)i< r} r+2 - C(t, i) ) . 



After the L th iteration, the marginal distribution of the i th subject can be approximated as 

|W) = o)| 

Upon computing the marginals, the set of defectives may be determined using two methods. In 
the first method, 



P (L) (w(ON,C)cx(^I(w(.) = l) + (l--) I(W(<)=0)) ]1 ^2(W(«)). 



P = {* : P (i) (w(«) = l|z ) C)>P (L) (w(^)=0|z,C) 



(32) 



while in the second method 



V = I i : has one of the d largest P (L \w(i) = l|z, C) 



(33) 



For demonstrative purposes, we applied the BP algorithm to an equidistant SQGT model with 
77 = 2. We used Construction 2 to generate codes with n = 100 and d = 15. Fig. [6] illustrates the 
probability of error, P e , as a function of q for different values of 7 P and j n , when m = 50. We 
performed 400 trials for each value of q, and the number of iterations in the BP algorithm was 
fixed to L = 20. The set of defectives was obtained using ( f33| ). Fig. [7] shows the performance 
of the BP algorithm in a similar setting when ((32]) was used to obtain the set of defectives^] In 



9 In this figure, we only showed the noisy case in order to keep the figure clear and avoid crowdedness. 
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Fig. 6: Probability of error as a function of test matrix alphabet size q for different noise 
distributions. In this model we fixed rj = 2, n = 100, d = 15, and m = 50. 



this figure, the probability of false negatives, Pfn, is defined as the probability that a defective 
is not detected, and the probability of false positives, Pfp, is defined as the probability that a 
non-defective subject is detected as defective. Note that when we use ( [33] ), P e = Pfn - Pfp- 

In order to demonstrate the effect of m on the performance of the algorithm, we applied the BP 
algorithm to a similar equidistant SQGT model with 77 = 2. Using Construction 2, we generated 
codes with n = 100, d = 15, and q = 11. Fig. [8] shows the probability of error as a function of m 
for noisy and noise-free scenarios when ( [33] ) was used to obtain the set of defectives. For each 
m, the BP algorithm was applied to 400 codes and it ran for L = 20 iterations. Similarly, Fig. [9] 



demonstrates the probabilities of false negatives and false positives when ( [32 ) was used to find 
the set of defectives. 

As may be seen from the simulation results, there is a clear advantage of using codes with 
q > 3 from the perspective of BP decoding in the presence of errors. Unfortunately, this effect is 
accompanied by an increase in the complexity of non-binary BP decoding. One may also notice 
that the decoding error of the BP decoders is high - this effect may be attributed to the fact that 
the random codes were generated for parameters that are not in the range of values that would 
assure high probability for the SQ separability property. It remains an open problem to find a 
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Fig. 7: Probability of false negatives and false positives as a function of q for different noise 
distributions. The solid lines represent the probability of false negatives and the dashed lines 
represent the probability of false positives. In this model we fixed i] = 2, n = 100, d = 15, and 
m = 50. 




Fig. 8: Probability of error as a function of m for different noisy and noise-free scenarios. In 
this model we fixed r] = 2, n = 100, d = 15, and q = 11. 
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m 

Fig. 9: Probability of false negatives and false positives as a function of q for different noise 
distributions. The solid lines represent the probability of false negatives and the dashed lines 
represent the probability of false positives. In this model we fixed 77 = 2, n = 100, d = 15, and 
q = ll. 

construction for SQ codes better suited to BP decoding. 

VI. Capacity of SQGT 

In Section |IV} we described explicit and probabilistic constructions for SQGT test matrices 
capable of identifying defectives with zero probability of error. On the other hand, a natural 
question to ask is what if we relax this constraint, and instead construct test matrices that allow 
identifying the defectives with zero-converging average probability of error. The answer to this 
question is closely related to Shannon's random coding theory. In particular, it is well-known 
that different models of group testing may be viewed as special instances of a multiple access 
channel (MAC). Using this connection, asymptotic information theoretic bounds were obtained 
for the number of required tests to approach zero-error (e.g. see [|54ll . 115511 . Il23ll . (561, f53|). 
Using these ideas, one can define the capacity of a group testing "channel" similar to the capacity 
of a communication channel. 

Our goal in this section is not to derive new bounds on the number of tests for generalized 
MAC models, as substantial work has already been done regarding various extensions of MAC 
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models. Rather, we use the existing results and adapt them to the framework of SQGT while 
introducing novel ideas about optimal threshold selection for the decimator. In other words, we 
introduce into the GT framework an interesting new component from the area of source coding, 
namely the design of the best quantization scheme for adder channels. Although one may argue 
that in genotyping applications the thresholds are usually fixed by specific system design, it is 
still an interesting question to find what the optimal thresholds would be if one were in the 
situation to impose such thresholds. 

Although almost all information-theoretic approaches rely on using probabilistic constructions 
of binary test matrices for CGIpj the generalization of these methods to non-binary test matrices 
in a SQGT model is straightforward. The main difference arises due to the different forms of the 
mutual information used to express the necessary and sufficient conditions and, consequently, 
the capacity of the group testing scheme. We therefore focus on characterizing the mutual 
informations arising in the SQGT framework. 

Let the sample amount of each subject in each test be chosen in an i.i.d manner from a g-ary 
alphabet, according to a distribution P?. Also, let T>\^ and be disjoint partitions of the set 
of defectives, V, such that \T>^ \ = i and \V^\ = d-i; we denote by A$ the set of all possible 
pairs (T>1 1 \t>2^). For a single test with result y, we define (where j = 1,2) to be a vector 
of size 1 x |£>^|, with its k th entry equal to the sample amount of the k th defective of in the 



3 1 J 



test. Fig. 10 shows a choice of (T>\ ,T>\ ) and their corresponding vectors tjy' and tjj for 
the case where d = 5 and q = 2. Also, let /(t^; t^, y) be the mutual information between 
and (t$,y). Note that since the amount of each subject in each test is chosen independently 
and with the same probability distribution, the value of J(t^;t^,y) does not depend on the 
specific choice of (T>\^ ,T>^) and only depends on i, P T , and d. Similar to the previous sections 
let R = denote the rate of a SQGT test matrix. Using this notation, the asymptotic capacity 
of a channel corresponding to the SQGT scheme (henceforth, SQGT channel) can be defined as 
follows. 

Definition 6 (Asymptotic capacity of SQGT channel): The asymptotic capacity a SQGT channel 

l0 The results in 1231 Section 6] are derived for binary test matrices in a general symmetric MAC. "Probabilistic construction" 
in these derivations refers to the fact that the entries of the test matrix are chosen in an i.i.d. manner, with probability of a 
subject being included in a test equal to p. 
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(a) (b) 

Fig. 10: One choice of (T>\ 2 ^ \T>^) and their corresponding and in a binary test design 
for d = 5. 



is defined as 

C = sup p TT) a(d,P T ,'n), (34) 

r(t£ } ;tg } l2/ ) 

where a(d } P T} r)) = mm i=1>2 ,..,d \ 2 ■ 

If the thresholds rj are determined a priori by the resolution of the test equipment, the only 
design parameter to optimize over is Pt. On the other hand, if one is able to control the 
thresholds, rj becomes a design parameter that clearly exhibits a strong influence on the capacity 
of the testing scheme. 

Definition [6] is a direct consequence of some simple modifications of the bounds for the zero- 
converging average probability of errors in [56], namely the sufficient condition of the form 



.log (TOO 

i} 

and the necessary condition of the form 



m> max ,., 1 .., - — i = l,2, d, (35) 

*(^*)e4 /(tg;< } ,y) 



log ( . J 

m> max r- — i = 1, 2, d. (36) 

^«^>)^>/(t«;tg,y) 

Further simplifications are possible by noting that for a fixed distribution Pt and for fixed rj, 

I(t {d} -t {d} v) I(t {d - l} -t {d - 1} v) 
(pi , P2,Z/) < ^ < - < /(tffjtg.y), (37) 

which is proved in It55ll and ll23Tl for the binary test matrices. The generalization of this result 
to the non-binary case is straightforward, and the proof is therefore omitted. The next theorem 
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further clarifies the use of the term "capacity" in Definition [6] 

Theorem 10: For the SQGT channel, the capacity equals C = sup P , )7J /(tg } ;tg } ,2/)M and all 
rates bellow capacity are achievable. In other words, for every rate R < C, there exists a test 
design for which the average probability of error converges to zero. Conversely, any test design 
with average probability of error approaching zero must asymptotically satisfy R< C. 

Proof: One way to prove this theorem is by following and properly adapting the steps 
taken in Il55ll and 11231 . Equivalently, one can use (|35])-([37|) - for completeness, we used the 
latter approach and provided the full proof in Appendix [B] ■ 
The mutual information I(t^;t^ ,y) in this theorem may be evaluated as follows. Let W\ 
denote the /i-norm of t^. Then in the absence of noise, 

J SQ (tg } ;tg } ,2/) = H(y\t%) - #(y|tg\tg } ) = H(y). 
On the other hand, V/ e [Q], 

ma- 1 

P(y = I) = P( m < Wi < = £ P Wl (wi), 
where P Wl (wi) is the probability mass function (PMF) of W\ and can be found using 

P Wl (w 1 )=P T (t 1 )*P T (t 2 )*-*P T (t d ), 

where "*" denotes convolution. Note that when q = 2, 

Vl ^ 1 (d\ 



P(y = i)= E ( .W-jO*-', 



and p denotes the probability that a subject is present in a test. 

Due to the complicated expression for the mutual information for an arbitrary distribution, 



a closed-form expression for the test capacity cannot be obtained. We therefore evaluated ( |34| ) 
numerically using a simple search procedure that allows us to quickly determine a lower bound 
on the capacity. Fig. 2 shows the obtained lower bound on the capacity when q = 3, and Q = 2 
or Q = 3. Table |n] shows one set of probability distributions and thresholds achieving this bound 
for Q = 3. 

Table |n] reveals an interesting property of the quantizers found through numerical search: 
there exists at least one quantization region that consists of one or two elements only. What this 
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in 



Fig. 11: Numerically obtained lower bound for SQGT with q = 3 for different values of d. 



TABLE II: A set of probability distributions and thresholds corresponding to Q = 3 in Fig. 11 



d 


Pt 


quantizer 


2 


[0.33 0.34 0.33] 


{0,1}{2}{3,4} 


3 


[0.43 0.46 0.11] 


{0,1}{2}{3,4,5,6} 


4 


[0.18 0.64 0.18] 


{0,1,2,3}{4}{5,6,7,8} 


5 


[0.15 0.70 0.15] 


{0,1,2, 3,4}{5}{6,7,8,9,10} 


6 


[0.46 0.15 0.39] 


{0,1,2,3,4}{5,6}{7,8,-,12} 


7 


[0.34 0.25 0.41] 


{0,1,-,6}{7,8}{9,10,-,14} 


8 


[0.10 0.80 0.10] 


{0,1,-,7}{8}{9,10,-,16} 


9 


[0.09 0.82 0.09] 


{0,1,-,8}{9}{10,11,-,18} 


10 


[0.58 0.28 0.14] 


{0,l,-,4}{5,6}{7,8,-,20} 



finding implies is that in order to reduce the number of tests as much as possible, a sufficient 
amount of qualitative information has to be preserved. For example, by having a quantizer that 
assigns the value v only to inputs of value v allows for resolving a large amount of uncertainty 
about the identity of the test subjects. Furthermore, the most informative input, left unaltered 
after quantization, corresponds to a statistical average of the input symbols, reminiscent to the 
centroid of a quantization region. 

VII. Conclusions 

We introduced the notion of semi-quantitative group testing amenable for pooling schemes 
associated with high-throughput genotyping applications. We showed that the SQGT model can 
be considered as a unifying framework for group testing in the sense that most known group 
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testing models are special cases of SQGT. For the novel (possibly) non-binary group testing 
framework, we generalized the notion of disjunct and separable codes and provided a number 
of combinatorial and probabilistic constructions for such codes. Furthermore, we developed a 
BP-decoding framework for semi-quantitative testing that may be used for testing schemes with 
measurement errors. Finally, we extended the notion of the capacity of group testing so that it 
applies to semi-quantitative testing, and we numerically evaluated this test invariant for a number 
of practical code parameters. 

Appendix A 
Proof of Theorem [9] 

In order to prove this theorem, we use the following lemma from 0711 . 

Lemma 3: Let T be a collection of sets such that if B e T, then T contains all the subsets of B 
as well. In other words, Vi3 e T if A c B, then A e J 7 . Let g : T {0, 1} be a function defined 
on T such that for some fixed set S e T, one has g(A n S) = g(A) when A e T. If C e T and 
C t S, then 

I 9(A) = £ g{A). 

A £ C A £ C 

\A\ is odd \A\ is even 

Proof: See 07J. ■ 
Proof of Theorem [9^ As before we define w e [2] n as a binary vector such that its I th 
coordinate is equal to 1, if the I th subject is defective, and otherwise. From the construction, 
the matrix C is formed from m sub-matrices Q, each corresponding to a subset of [re], Si. 
This implies that each Si corresponds to a set of variables (i.e. coordinates of w). In addition, 
we label rows of C using subsets Si, i e [m], such that the i th row is labeled by <Sj. Since each 
row of C corresponds to an equation in y = Cw, then each Si also corresponds to exactly one 
equation. 

The decoding includes m steps, and in each step one solves for the variables corresponding to 
Si, given all the variables corresponding to Si+i,Si+z, ■■•,S m . To find the variables corresponding 
to Si, we form two equations: the first one is obtained by adding all the equations corresponding to 
the odd subsets of Si and the second one is obtained by adding all the equations corresponding 
to the even subsets of Si. These two equations can be represented by s D dd 4 T w = and 
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s e veni T w = y eveni , respectively. Finally, we form the equation 

( s oddj _ s even i; ) T W = 2/odd, ~ 2/evenr (38) 

For simplicity let Wi k be the k th variable corresponding to Si, where k e \q" + \Si\J. The key 
in the proof of the theorem is to show that ([38]) is of the form 



+ + - + = a, 

where a is a scalar that depends on y and the known variables corresponding to Si + i,S i+2 , S m . 
This implies that all the coefficients of the variables corresponding to «S 1; S 2 , <Si-i are zero; 
also, since w ik e [2] for all k e \q" + |«Si|], the unknown variables can be determined by finding 
the unique binary representation of a. Note that the coefficient of the variable Wi k , I < i, is equal 
to 

£ Q(j,A0- E C,(j,A;). 

\Sj\ is odd |<Sj| is even 

We first show that VZ < z, the coefficients of the variables corresponding to Si in ( [38] ) are all 
zero. Although Lemma [3] cannot be directly applied to our problem (matrix C is not binary), we 
can use this lemma for a part of the proof. Let jF = {<S}™; this collection satisfies the condition 
of Lemma [3] Let I < i; due to the specific ordering of the members of !F, we have Si$Si, and we 
can set C = Si and S = Si. Consider the k th column of Q, where k e {q" + 1, q" + 2, q" + \Si\}. 
For this column, let gi,k(Sj) = Ci(j,k). Some inspection shows that gi±(Sj nSi) = gi t k(Sj), 
Vj e [m], and e {0, 1}. Using Lemma [3j we conclude that 

E 0U($)= E ^(^)- (39) 

\Sj\ is odd \Sj\ is even 

Now consider the k th column of Q, where k e [g"J. For this column, let gi,k(Sj) = Ci(j,k). 
This function does not satisfy the binary condition in Lemma [3] However since gi k (Sj) = 
2<i"- k+1 g l q ,, +l (Sj), using ( f39| ) one has 



E gi, k (S j ) = 2«"- k+1 E 9w+iiSj) 

j. Sj c Sj j: Sj c Sj 

|cSj| is odd |cSj| is odd 
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2 q "- k+1 E 9^ + i{S 3 ) 

j: Sj c Si 
\Sj\ is even 

E 9iA S i)- 

j: Sj £ Si 
\Sj\ is even 



Consequently, V7 < % and A; € \q" + one has 



£ co',*)- E c,o\*) = o. 



(40) 



|Sj| is odd 



j: Sj £ S z 
\Sj\ is even 



To complete the proof, consider the k th column of Cj, where k e [g" + l]. Since ( |38] ) is formed 
using the rows labeled by odd and even subsets of S{, the coefficient of Wi k is equal to 



E Ci(j,k) - £ QO',A:)=2«"- fc+1 -2^l- 1 -0 = 2«" + l 5 *l- 

j: Sj c Sj j": Sj £ Si 

|Sj| is odd |Sj| is even 



(41) 



where 2l 5i l _1 is the number of odd subsets of Si. Now we consider the k th column of Cj, where 
k e {q" + 2,q" + 3,---,q" + |«Sj|}. From the definition of %,k and its relationship to %,k-u it can 
be shown that the coefficient of the variable w ik is equal to 



E C 4 0\A;)-0= E Ci(j,k) 

j: Sj c Sj j- Sj c 5,: 

|Sj| is odd |Sj| is odd 

= E 1 [{l^i n 7 i,9"+2| is odd l n ••• n n %,k-i\ is odd} n {\Sj n "7*,*! is odd )] 

J : Sj c Sj 
|Sj| is odd 

4 E 1 Kl 5 i n V+al is odd} n n {|<% n is odd}] = - = 2*" + l*l-\ (42) 

- J= S, c 5, 
|Sj| is odd 



Using (40 1, (41 1, and (|42>, it can be seen that (38) is of the form 



g"+|s,| 
fc=i 

where a depends on y and the known variables corresponding to «St+i,«Si+2, •", S m . This completes 
the proof of the claimed result. 
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Appendix B 
Proof of TheoremQI)] 



Proof of Theorem 10: First, we prove that any rate R < C is achievable. Since C = 
supp TV a(d,PT,ri), then Ve > there exist P' T and rj' such that C-e < a(d,P^,rj') < C. 
Let e = C - R and a' = a(d, P^,rj'); then there exists a test design with parameters P' T and ij' 
such that R < a'. Generate a random code of size n and length m according to P^ for a test 
with thresholds rj'. Let < e' < \a' - R\. Then, 

, , log n . . log n + me' 

R + e <at => + e < a => m > — . 

m a' 

For any choice of e' and sufficiently large enough values of m and n, me' > log(<ie 2 )+log (l - -) 
and therefore, 

logn + me' log(cie 2 ) + log (l - + logra 
m > > 



max, log (^f) 
a' 

1 , / (n-d)edeY 

a' 

ma* } log (V) (J) 



pM 5 )^ I(tg;t« y)' 



max 



Using p5| ), these inequalities imply that the average probability of error converges to zero as 
m, n -> oo. 

Conversely, if the average probability of error converges to zero, than for any i £ {1,2, ■■■,d} 
one has 

jogC^L j(tg;ts, y ) jog rfp 

m > ttt ttt => ; > ; 

/(tg;tg,y) 

> io g _0g^) 
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Consequently, 



log(^) 
a > mm 



m 



M3) 



= R 



m 
logd 



m 



which in the asymptotic regime simplifies to R < a. As a result, the inequality R < sup PT r? a 
holds in the asymptoticregime and therefore R<C. 
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